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Decision Tree 


Roadmap 


@ Embedding Numerous Features: Kernel Models 
© Combining Predictive Features: Aggregation Models 


Lecture 8: Adaptive Boosting 

optimal re-weighting for diverse hypotheses 
and adaptive linear aggregation to 

boost ‘weak’ algorithms 













Lecture 9: Decision Tree 
Decision Tree Hypothesis 
Decision Tree Algorithm 

Decision Tree Heuristics in C&RT 
Decision Tree in Action 





© Distilling Implicit Features: Extraction Models 
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Decision Tree Decision Tree Hypothesis 


What We Have Done 


blending: aggregate after getting gr; 
learning: aggregate as well as getting gr | 



















aggregation type blending learning 
uniform voting/averaging Bagging 
non-uniform linear AdaBoost 
conditional stacking Decision Tree 











decision tree: a traditional learning model that 
realizes conditional aggregation | 
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Decision Tree Decision Tree Hypothesis 


Decision Tree for Watching MOOC Lectures 


quitting 
time? 


a 
G(x) = I, AR) - gx) 

t=1 
base hypothesis g;(x): 
leaf at end of path f, 
a constant here 


condition q(x): 
[is x on path ¢?] 
usually with simple 
internal nodes 


decision tree: arguably one of the most 
human-mimicking models | 
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Decision Tree Decision Tree Hypothesis 


Recursive View of Decision Tree 
Path View: G(x) = DL, [x on path t] - leafı(x) 












Recursive View 








quitting C 
G(x) = Y 1b(x) = c] - Go(x) 


c=1 
e G(x): full-tree hypothesis 
e b(x): branching criteria 
e G(x): sub-tree hypothesis at 
the c-th branch 


tree = (root, sub-trees), just like what 
your data structure instructor would say :-) 
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Decision Tree Decision Tree Hypothesis 


Disclaimers about Decision Tree 


Howerver...... 

e heuristic: 
mostly little theoretical 
explanations 

e heuristics: 
‘heuristics selection’ 
confusing to beginners 

e arguably no single 
representative algorithm 


e human-explainable: widely 
used in business/medical 
data analysis 

e simple: 
even freshmen can 
implement one :-) 

e efficient in prediction and 
training 


decision tree: mostly heuristic 
but useful on its own | 
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Decision Tree Decision Tree Hypothesis 


Fun Time 


The following C-like code can be viewed as a decision tree of three 
leaves. 


if (income > 100000) return true; 
else { 
if (debt > 50000) return false; 
else return true; 


} 


What is the output of the tree for (income, debt) = (98765, 56789)? 


@ true © 98765 
@ false © 56789 





Decision Tree Decision Tree Hypothesis 


Fun Time 


The following C-like code can be viewed as a decision tree of three 
leaves. 


if (income > 100000) return true; 
else { 
if (debt > 50000) return false; 
else return true; 


} 


What is the output of the tree for (income, debt) = (98765, 56789)? 


@ true © 98765 
@ false © 56789 








Reference Answer: (2) 


You can simply trace the code. The tree 
expresses a complicated boolean condition 
[income > 100000 or debt < 50000]. 
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Decision Tree Decision Tree Algorithm 


A Basic Decision Tree Algorithm 


C 
G(x) = 2 [b(x) = c] Ge(x) 







function DecisionTree(data D = {(Xn, Yn)}*_,) 
if termination criteria met 
return base hypothesis g;(x) 
else va 
© learn branching criteria b(x) 
O split D to C parts De = {(Xn, Yn): b(Xn) = C} 
6) build sub-tree G. + DecisionTree(D¿) 


O return G(x) = > [b(x) = cl Ge(x) 


cil 


four choices: number of branches, branching 
criteria, termination criteria, & base hypothesis 


Decision Tree Decision Tree Algorithm 


Classification and Regression Tree (C&RT) 


function DecisionTree(data D = {(Xn, Yn) }*_,) 
if termination criteria met 

return base hypothesis g;(x) 
else ... 


O split D to C parts De = {(Xn, Yn): D(Xn) = c} 





two simple choices 
e C = 2 (binary tree) 
e g(x) = E¡n-optimal constant 


e binary/multiclass classification (0/1 error): majority of {yn} 
e regression (squared error): average of {yn} 







disclaimer: 
C&RT here is based on selected components 
of CART™ of California Statistical Software 
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Decision Tree Decision Tree Algorithm 


Branching in C&RT: Purifying 


function DecisionTree(data D = {(Xn, Yn) }*_,) 
if termination criteria met 
return base hypothesis g(x) = En-optimal constant 


else ... eels 
© learn branching criteria b(x) 


O split D to 2 parts De = {(Xn, Yn): b(Xn) = c} 





more simple choices 


e simple internal node for C = 2: (1, 2)-output decision stump 
e ‘easier’ sub-tree: branch by purifying 








2 
b(x)=  argmin >; [De with h| + impurity(D. with h) 


decision stumps h(x) ¿4 





C&RT: bi-branching by purifying 
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Decision Tree Decision Tree Algorithm 


Impurity Functions 


by Ein of optimal 


e regression error: 
4 N 
impurity(D) = m > on we 


with y = average of {yn} 
e classification error: 


impurity(D 53 lyn + y] 











with y* = majority of {yn} 





for classification 








e Gini index: 
K N 2 
m [Yn = k] 
4 
a 


—all k considered together 
e classification error: 


1 max 
1<k<K 





—optimal k = y* only 





popular choices: Gini for classification, 
regression error for regression 


Machine Learning Techniques 
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Decision Tree Decision Tree Algorithm 


Termination in C&RT 


function DecisionTree(data D = {(Xn, Yn) }*_,) 
if termination criteria met 
return base hypothesis g(x) = En-optimal constant 


else ... ee 
@ learn branching criteria 


2 
b(x)= argmin >; [De with h| - impurity(D. with h) 


decision stumps h(x) 


C=! 











‘forced’ to terminate when 
e all y, the same: impurity = 0 => 9:(x) = Yn 
e all x, the same: no decision stumps 





C&RT: fully-grown tree with constant leaves 
that come from bi-branching by purifying 
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Decision Tree Decision Tree Algorithm 


Fun Time 


2 
For the Gini index, 1 — DI, (=) . Consider K = 2, and let 


b= = where N; is the number of examples with y, = 1. Which of the 
following formula of y, equals the Gini index in this case? 


O 2u(1 — y) 
O 2p%(1 — y) 
O 2u(1 — y 


O 21(1 — u) 





Decision Tree Decision Tree Algorithm 


Fun Time 


2 

For the Gini index, 1 — DI, (=) . Consider K = 2, and let 

= = where N; is the number of examples with y, = 1. Which of the 
following formula of y, equals the Gini index in this case? 

O 2u(1 — y) 

O 21(1 — y) 

O 2u(1 — Y 

O 21(1 — py* 


Reference Answer: (1) 


Simplify 1 — (u? + (1 — py?) and the answer 
should pop up. 
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Decision Tree Decision Tree Heuristics in C&RT 


Basic C&RT Algorithm 


function DecisionTree(data D = {(Xn, Yn)}N_,) 
if cannot branch anymore 
return 9;(X) = Ein-optimal constant 


else or 
@ learn branching criteria 
2 
b(x)=  argmin S [De with h| - impurity(D. with h) 
decision stumps h(x) ¿3 


© split D to 2 parts De = {(Xn, Yn): D(Xn) = c} 
6) build sub-tree G. — DecisionTree(D¿) 


O return G(x) = > [b(x) = cl Ge(x) 
c=1 


easily handle binary classification, 
regression, & multi-class classification 
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Decision Tree Decision Tree Heuristics in C&RT 


Regularization by Pruning 


fully-grown tree: Ejn(G) = 0 if all x, different 
but overfit (large Eout) because low-level trees built with small D. 





e need a regularizer, say, 2(G) = NumberOfLeaves(G) 
e want regularized decision tree: 









argmin En(G) + AQ(G) 
all possible G 


—called pruned decision tree 


+ cannot enumerate all possible G computationally: 
—often consider only 
e G( = fully-grown tree 
e Gl = argmin¿ Ein(G) such that Gis one-leaf removed from GÜ-") 





systematic choice of \? validation 
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Decision Tree Decision Tree Heuristics in C&RT 


Branching on Categorical Features 


numerical features 


blood pressure: 
130, 98, 115, 147, 120 


categorical features 
major symptom: 
fever, pain, tired, sweaty 




















branching for numerical 
decision stump 





branching for categorical 
decision subset 


b(x)= [x; < 6] +1 


b(x) = [x € S] +1 


with 0 € R with Sc {1,2,...,K} 


C&RT (& general decision trees): 
handles categorical features easily 
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Decision Tree Decision Tree Heuristics in C&RT 


Missing Features by Surrogate Branch 
possible b(x) = [weight < 50kg] 


if weight missing during prediction: 
e what would human do? 
e go get weight 
e or, use threshold on height instead, because 
threshold on height ~ threshold on weight 
e surrogate branch: 
e maintain surrogate branch b; (x), b2(X), ... ~ best branch b(x) 
during training 
e allow missing feature for b(x) during prediction by using surrogate 
instead 











C&RT: handles missing features easily 
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Decision Tree Decision Tree Heuristics in C&RT 


Fun Time 


For a categorical branching criteria b(x) = [x; € S] + 1 with 
S = {1,6}. Which of the following is the explanation of the criteria? 
© if i-th feature is of type 1 or type 6, branch to first sub-tree; else 
branch to second sub-tree 
@ if i-th feature is of type 1 or type 6, branch to second sub-tree; 
else branch to first sub-tree 
© if i-th feature is of type 1 and type 6, branch to second sub-tree; 
else branch to first sub-tree 
© if i-th feature is of type 1 and type 6, branch to first sub-tree; else 
branch to second sub-tree 











Decision Tree Decision Tree Heuristics in C&RT 


Fun Time 


For a categorical branching criteria b(x) = |x; € S] +1 with 
S = {1,6}. Which of the following is the explanation of the criteria? 
@ if /-th feature is of type 1 or type 6, branch to first sub-tree; else 
branch to second sub-tree 
@ if i-th feature is of type 1 or type 6, branch to second sub-tree; 
else branch to first sub-tree 
© if i-th feature is of type 1 and type 6, branch to second sub-tree; 
else branch to first sub-tree 
© if i-th feature is of type 1 and type 6, branch to first sub-tree; else 
branch to second sub-tree 


Reference Answer: (2) 


Note that 'e S’ is an “or'-style condition on the 
elements of S in human language. 
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Decision Tree Decision Tree in Action 


A Simple Data Set 





Decision Tree Decision Tree in Action 


A Simple Data Set 


C&RT AdaBoost-Stump 


C&RT: ‘divide-and-conquer’ 
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Decision Tree Decision Tree in Action 


A Complicated Data Set 





AdaBoost-Stump 


C&RT: even more efficient than 
AdaBoost-Stump | 
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Decision Tree Decision Tree in Action 


Practical Specialties of C&RT 


e human-explainable 

e multiclass easily 

e categorical features easily 

e missing features easily 

e efficient non-linear training (and testing) 


—almost no other learning model share all such specialties, 
except for other decision trees 





another popular decision tree algorithm: 
C4.5, with different choices of heuristics 
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Decision Tree Decision Tree in Action 


Fun Time 


Which of the following is not a specialty of C&RT without pruning? 
@ handles missing features easily 
© produces explainable hypotheses 
© achieves low Ein 
© achieves low Eout 





Decision Tree Decision Tree in Action 


Fun Time 


Which of the following is not a specialty of C&RT without pruning? 
@ handles missing features easily 
6) produces explainable hypotheses 
© achieves low Ein 
© achieves low Eout 


Reference Answer: © 


The first two choices are easy; the third comes 
from the fact that fully grown C&RT greedy 
minimizes Ein (almost always to 0). But as you 
may imagine, overfitting may happen and Eout 
may not always be low. 
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Summary 


@ Embedding Numerous Features: Kernel Models 
© Combining Predictive Features: Aggregation Models 


Lecture 9: Decision Tree 
e Decision Tree Hypothesis 
express path-conditional aggregation 
e Decision Tree Algorithm 
recursive branching until termination to base 
e Decision Tree Heuristics in C&RT 
pruning, categorical branching, surrogate 
e Decision Tree in Action 
explainable and efficient 





e next: aggregation of aggregation?! 


@ Distilling Implicit Features: Extraction Models 





